INTERSPEECH.2004 - Speech Recognition

Total: 182

#1 Stochastic gradient adaptation of front-end parameters [PDF] [Copy] [Kimi1]

Authors: Sreeram Balakrishnan ; Karthik Visweswariah ; Vaibhava Goe

This paper examines how any parameter in the typical front end of a speech recognizer, can be rapidly and inexpensively adapted with usage. It focusses on firstly demonstrating that effective adaptation can be accomplished using low CPU/Memory cost stochastic gradient descent methods, secondly showing that adaptation can be done at time scales small enough to make it effective with just a single utterance, and lastly showing that using a prior on the parameter significantly improves adaptation performance on small amounts of data. It extends previous work on stochastic gradient descent implementation of fMLLR and work on adapting any parameter in the front-end chain using general 2nd order opimization techniques. The framework for general stochastic gradient descent of any frontend parameter with a prior is presented, along with practical techniques to improve convergence. In addition the methods for obtaining the alignment at small time intervals before the end of the utterance are presented. Finally it shown that experimentally online causal adaptation can result in a 5-15% WER reduction across a variety of problems sets and noise conditions, even with just 1 or 2 utterances of adaptation data.

#2 Maximum - likelihod adaptation of semi-continuous HMMs by latent variable decomposition of state distributions [PDF1] [Copy] [Kimi1]

Authors: Antoine Raux ; Rita Singh

Compared to fully-continuous HMMs, semi-continuous HMMs are more compact in size, require less data to train well and result in comparable recognition performance with much faster decoding speeds. Nevertheless, the use of semi-continuous HMMs in large vocabulary speech recognition systems has declined considerably in recent years. A significant factor that has contributed this is that systems that use semi-continuous HMMs cannot be easily adapted to new acoustic (environmental or speaker) conditions. While maximum likelihood (ML) adaptation techniques have been very successful for continuous density HMMs, these have not worked to a usable degree for semi-continuous HMMs. This paper presents a new framework for supervised and unsupervised ML adaptation of semi-continuous HMMs, built upon the paradigm of probabilistic latent semantic analysis. Experiments with a specific implementation developed under this framework demonstrate its effectiveness.

#3 Transformation and combination of hiden Markov models for speaker selection training [PDF] [Copy] [Kimi]

Authors: Chao Huang ; Tao Chen ; Eric Chang

This paper presents a 3-stage adaptation framework based on speaker selection training. First a subset of cohort speakers is selected for test speaker using Gaussian mixture model, which is more reliable given very limited adaptation data. Then cohort models are linearly transformed closer to each test speaker. Finally the adapted model for the test speaker is obtained by combining these transformed models. Combination weights as well as bias items are adaptively learned from adaptation data. Experiments showed that model transformation before combination would improve the robustness of the scheme. With only 30s of adaptation data, about 14.9% relative error rate reduction is achieved on a large vocabulary continuous speech recognition task.

#4 Improving eigenspace-based MLLR adaptation by kernel PCA [PDF] [Copy] [Kimi]

Authors: Brian Mak ; Roger Hsiao

Eigenspace-based MLLR (EMLLR) adaptation has been shown effective for fast speaker adaptation. It applies the basic idea of eigenvoice adaptation, and derives a small set of eigenmatrices using principal component analysis (PCA). The MLLR adaptation transformation of a new speaker is then a linear combination of the eigenmatrices. In this paper, we investigate the use of kernel PCA to find the eigenmatrices in the kernel-induced high dimensional feature space so as to exploit possible nonlinearity in the transformation supervector space. In addition, composite kernel is used to preserve the row information in the transformation supervector which, otherwise, will be lost during the mapping to the kernel-induced feature space. We call our new method kernel eigenspace-based MLLR (KEMLLR) adaptation. On a RM adaptation task, we find that KEMLLR adaptation may reduce the word error rate of a speakerindependent model by 11%, and outperforms MLLR and EMLLR adaptation.

#5 Rapid acoustic model development using Gaussian mixture clustering and language adaptation [PDF] [Copy] [Kimi]

Authors: Nikos Chatzichrisafis ; Vasilios Digalakis ; Vasilios Diakoloukas ; Costas Harizakis

This work presents techniques for improved cross-language transfer of speech recognition systems to new, previously undeveloped, languages. Such techniques are particularly useful for target languages where minimal amounts of training data are available. We describe a novel method to produce a language-independent system by combining acoustic models from a number of source languages. This intermediate language-independent acoustic model is used to bootstrap a target-language system by applying language adaptation. For our experiments we use acoustic models of seven source languages to develop a target Greek acoustic model. We show that our technique significantly outperforms a system trained from scratch when less than 8 hours of read speech is available.

#6 Adaptation of front end parameters in a speech recognizer [PDF] [Copy] [Kimi]

Authors: Karthik Visweswariah ; Ramesh Gopinath

In this paper we consider the problem of adapting parameters of the algorithm used for extraction of features. Typical speech recognition systems use a sequence of modules to extract features which are then used for recognition. We present a method to adapt the parameters in these modules under a variety of criteria, e.g maximum likelihood, maximum mutual information. This method works under the assumption that the functions that the modules implement are differentiable with respect to their inputs and parameters. We use this framework to optimize a linear transform preceding the linear discriminant analysis (LDA) matrix and show that it gives significantly better performance than a linear transform after the LDA matrix with small amounts of data. We show that linear transforms can be estimated by directly optimizing likelihood or the MMI objective without using auxiliary functions. We also apply the method to optimize the Mel bins, and the compression power in a system that uses power law compression.

#7 Speaker normalization through constrained MLLR based transforms [PDF] [Copy] [Kimi]

Authors: Diego Giuliani ; Matteo Gerosa ; Fabio Brugnara

In this paper, a novel speaker normalization method is presented and compared to a well known vocal tract length normalization method. With this method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space. Recognition experiments made use of two corpora, the first one consisting of adults' speech, the second one consisting of children's speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract length normalization method.

#8 Multi-layer structure MLLR adaptation algorithm with subspace regression classes and tying [PDF] [Copy] [Kimi]

Authors: Xiangyu Mu ; Shuwu Zhang ; Bo Xu

MLLR is a parameter transformation technique for both speaker and environment adaptation. When the amount of adaptation data is scarce, it is necessary to do adaptation with regression classes. In this paper, we present a rapid MLLR adaptation algorithm, which is called Multi-layer structure MLLR adaptation with subspace regression classes and tying (SRCMLR). The method groups the Gaussians on a finer acoustic subspace level. The motivation is that clustering at subspaces of lower dimensions results in lower distortion, and there are fewer parameters to be estimated for the subsequent MLLR transformation matrix. On the other hand, the multi-layer structure generates a regression class dynamically for each subspace using the outcome of the former MLLR transformation. By using the transform structure, computation load in performing transformation is much reduced. Experiments in large vocabulary mandarin speech recognition show the advantages of SRCMLLR over the traditional MLLR while the amount adaptation data is scarce.

#9 Adaptation in the pronunciation space for non-native speech recognition [PDF] [Copy] [Kimi]

Authors: Georg Stemmer ; Stefan Steidl ; Christian Hacker ; Elmar Nöth

We introduce a new technique to improve the recognition of non-native speech. The underlying assumption is that for each non-native pronunciation of a speech sound, there is at least one sound in the target language that has a similar native pronunciation. The adaptation is performed by HMM interpolation between adequate native acoustic models. The interpolation partners are determined automatically in a data-driven manner. Our experiments show that this technique is suitable for both the off-line adaptation to a whole group of speakers as well as for the unsupervised online adaptation to a single speaker. Results are given both for spontaneous non-native English speech as well as for a set of read non-native German utterances.

#10 Robust ASR model adaptation by feature-based statistical data mapping [PDF] [Copy] [Kimi]

Authors: Xuechuan Wang ; Douglas O'Shaughnessy

Automatic speech recognition (ASR) model adaptation is important to many real-life ASR applications due to the variability of speech. The differences of speaker, bandwidth, context, channel and et al. between speech databases of initial ASR models and application data can be major obstacles to the effectiveness of ASR models. ASR models, therefore, need to be adapted to the application environments. Maximum Likelihood Linear Regression (MLLR) is a popular model-based method mainly used for speaker adaptation. This paper proposes a feature-based statistical Data Mapping (SDM) approach, which is more flexible than MLLR in various applications, such as different bandwidth and context. Experimental results on the TIMIT database show that ASR models adapted by the SDM approach have improved accuracy.

#11 A novel target-driven generalized JMAP adaptation algorithm [PDF] [Copy] [Kimi]

Authors: Zhaobing Han ; Shuwu Zhang ; Bo Xu

Adapting the parameters of a statistical speaker independent continuous speech recognizer to the speaker can significantly improve the recognition performance and robustness of the system. In this paper, we propose a novel target-driven speaker adaptation method, Generalized Joint Maximum a Posteriori (GJMAP), which extends and improves the previous successful method JMAP. GJMAP partitions the HMM parameters with respect to the adaptation data, using the priori phonetic knowledge. The generation of regression class trees is dynamically constructed on the target-driven principle in order to obtain the maximum increase of the auxiliary function. An off-line adaptation experiment on large vocabulary continuous speech recognition is carried out. The experimental results show GJMAP has more advantages than the conventional methods.

#12 Speedup of kernel eigenvoice speaker adaptation by embedded kernel PCA [PDF] [Copy] [Kimi]

Authors: Brian Mak ; Simon Ho ; James T. Kwok

Recently, we proposed an improvement to the eigenvoice (EV) speaker adaptation called kernel eigenvoice (KEV) speaker adaptation. In KEV adaptation, eigenvoices are computed using kernel PCA, and a new speaker's adapted model is implicitly computed in the kernel-induced feature space. Due to many online kernel evaluations, both adaptation and subsequent recognition of KEV adaptation are slower than EV adaptation. In this paper, we eliminate all online kernel computations by finding an approximate pre-image of the implicit adapted model found by KEV adaptation. Furthermore, the two steps of finding the implicit adapted model and its approximate pre-image are integrated by embedding the kernel PCA procedure in our new embedded kernel eigenvoice (eKEV) speaker adaptation method. When tested in an TIDIGITS task with less than 10s of adaptation speech, eKEV adaptation obtained a speedup of 6-14 times in adaptation and 136 times in recognition over KEV adaptation with 12-13% relative improvement in recognition accuracy.

#13 Maximum a posteriori eigenvoice speaker adaptation for Korean connected digit recognition [PDF] [Copy] [Kimi]

Authors: Hyung Bae Jeon ; Dong Kook Kim

In this paper, we present a maximum a posteriori (MAP) eigenvoice speaker adaptation approach to the self-adaptation system. The proposed MAP eigenvoice is developed by introducing a probability density for the eigenvoice coefficients such as MAPLR adaptation. And we make a self-adaptation system which is useful to public user, because user does not need to speak several sentences for adaptation. In self-adaptation system we use only one utterance that will be recognized, so we use eigenvoice adaptation with MAP criterion that is most robust adaptation algorithm for very small adaptation data. In a series of self-adaptation experiments on the Korean connected digit recognition task, we demonstrate that the proposed approach achieves a good performance for a very small amount of adaptation data.

#14 Vocal tract normalization based on spectral warping [PDF] [Copy] [Kimi]

Authors: Wei Wang ; Stephen Zahorian

Two techniques for speaker adaptation based on frequency scale modifications are described and evaluated. In one method, minimum mean square error matching is performed between a spectral template for each speaker to a "typical speaker" spectral template. One parameter, a warping factor, is used to control the spectral matching. In the second method, a neural network classifier is used to adjust the frequency warping factor for each speaker so as to maximize vowel classification performance for each speaker. A vowel classifier trained only with normalized female speech and tested only with normalized male speech, or vice versa, is nearly as accurate as when speaker genders are matched for training and testing, and the speech is not normalized. The improvement due to normalization is much smaller, if training and test data are matched. The normalization based on classification performance is superior to that based on minimizing mean square error.

#15 Acoustic model adaptation for coded speech using synthetic speech [PDF] [Copy] [Kimi]

Authors: Koji Tanaka ; Fuji Ren ; Shingo Kuroiwa ; Satoru Tsuge

In this paper, we describe a novel acoustic model adaptation technique which generates "speaker-independent" HMM for the target environment. Recently, personal digital assistants like cellular phones are shifting to IP terminals. The encoding-decoding process utilized for transmitting over IP networks deteriorates the quality of speech data. This deterioration causes degradation in speech recognition performance. Acoustic model adaptations can improve recognition performance. However, the conventional adaptation methods usually require a large amount of adaptation data. The proposed method uses HMM-based speech synthesis to generate adaptation data from the acoustic model of HMM-based speech recognizer, and consequently does not require any speech data for adaptation. Experimental results on G.723.1 coded speech recognition show that the proposed method improves speech recognition performance. A relative word error rate reduction of approximately 12% was observed.

#16 Speaker adaptation method for CALL system using bilingual speakers' utterances [PDF] [Copy] [Kimi]

Authors: Motoyuki Suzuki ; Hirokazu Ogasawara ; Akinori Ito ; Yuichi Ohkawa ; Shozo Makino

Several CALL systems have two acoustic models to evaluate a learner's pronunciation. In order to achieve high performance for evaluation, speaker adaptation method is introduced in CALL system. It requires adaptation data of a target language, however, a learner cannot pronounce correctly. In this paper, we proposed two types of new speaker adaptation methods for CALL system. The new methods only require learner's utterance of the native language for adaptation. The first method is an algorithm to adapt acoustic models using bilingual's utterances. The speaker-independent acoustic models of native and target languages are adapted to a bilingual speaker once, then they are adapted to the learner again using the learner's speech of the native language. Phoneme recognition accuracy is about 5% higher than the baseline method. The second method is a training algorithm of an acoustic model. It can robustly train bilinguals' model from a few bilinguals' utterances. Phoneme recognition accuracy is about 10% higher than the baseline method.

#17 Acoustic model adaptation based on coarse/fine training of transfer vectors and its application to a speaker adaptation task [PDF] [Copy] [Kimi]

Author: Shinji Watanabe

In this paper, we propose a novel adaptation technique based on coarse/fine training of transfer vectors. We focus on transfer vector estimation of a Gaussian mean from an initial model to an adapted model. The transfer vector is decomposed into a direction vector and a scaling factor. By using tied-Gaussian class (coarse class) estimation for the direction vector, and by using individual Gaussian class (fine class) estimation for the scaling factor, we can obtain accurate transfer vectors with a small number of parameters. Simple training algorithms for transfer vector estimation are analytically derived using the variational Bayes, maximum a posteriori (MAP) and maximum likelihood methods. Speaker adaptation experiments show that our proposals clearly improve speech recognition performance for any amount of adaptation data, compared with conventional MAP adaptation.

#18 Speaker clustering of speech utterances using a voice characteristic reference space [PDF] [Copy] [Kimi]

Authors: Wei-Ho Tsai ; Shih-Sian Cheng ; Hsin-Min Wang

This paper presents an effective technique for clustering speech utterances based on their associated speaker. In attempts to determine which utterances are from the same speakers, a prerequisite is to measure the similarity of voice characteristics between utterances. Since the vast majority of existing methods evaluate the inter-utterance similarity by taking only the information from the spectrum-based features of utterance pairs into account, the resulting clusters may not be well relevant to speaker, but instead likely to the environmental conditions or other acoustic classes. To compensate for this shortcoming, this study proposes to project utterances from their spectrum-based feature representation onto a reference space trained to cover the generic voice characteristics inherently in all of the utterances to be clustered. The resultant projection vectors naturally reflect the relationships between all the utterances and are more robust against the interference from non-speaker factors. We exemplarily present three distinct implementations for reference space creation.

#19 Performance improvement of connected digit recognition using unsupervised fast speaker adaptation [PDF] [Copy] [Kimi]

Authors: Young Kuk Kim ; Hwa Jeon Song ; Hyung Soon Kim

In this paper, we investigate unsupervised fast speaker adaptation based on eigenvoice to improve the performance of Korean connected digit recognition over the telephone channel. In addition, utterance verification is introduced into speaker adaptation to examine whether input utterance is appropriate to adaptation or not. Performance evaluation showed that the proposed method yielded performance improvements. We obtained 18%-22% string error reduction by the N-best-based fast speaker adaptation method with utterance verification using Support Vector Machine.

#20 Simultaneous estimation of weights of eigenvoices and bias compensation vector for rapid speaker adaptation [PDF] [Copy] [Kimi]

Authors: Hyung Soon Kim ; Hwa Jeon Song

Eigenvoice based speaker adaptation method is known to be very effective tool for rapid speaker adaptation. Stochastic matching approach is also known as a powerful method to reduce the mismatch between training and test environments. In this paper, we simultaneously applied two methods for speaker adaptation and environment compensation space based on the eigenvoice adaptation framework. In experiments for vocabulary-independent word recognition task with supervised mode adaptation, the proposed method shows higher performance improvement than conventional eigenvoice adaptation method with a small adaptation data. We obtained 19~30% relative improvement with only single adaptation utterance and obtained 37% relative improvement with 50 adaptation utterances by proposed method.

#21 Speaker dependent model order selection of spectral envelopes [PDF] [Copy] [Kimi]

Author: Matthias Wölfel

This work introduces a maximum-likelihood based model order (MO) selection technique for spectral envelopes to apply speaker dependent adaptation in the feature-space similar to vocal tract length normalization. Speech recognition systems based on spectral envelopes are using a fixed MO for the underlying linear parametric model. Using a fixed MO over different speakers or channels might not be optimal. To address this problem we investigated the use of warped and scaled minimum variance distortionless response spectral estimation techniques with speaker dependent MOs based on a maximum-likelihood criteria. Comparing experimental results on the Translanguage English Database we can show an improvement by 1,9% relative compared to the word error rate by the fixed MO and 3,5% relative to the traditional Mel-frequency cepstral coefficients.

#22 Methods for task adaptation of acoustic models with limited transcribed in-domain data [PDF] [Copy] [Kimi]

Authors: Enrico Bocchieri ; Michael Riley ; Murat Saraclar

Application specific acoustic models provide the best recognition accuracy, but they are expensive to train, because they require the transcription of large amount of in-domain speech. This paper focuses on the acoustic model estimation given limited in-domain transcribed speech data, and large amounts of transcribed out-of-domain data. First, we evaluate several combinations of known methods to optimize the adaptation/training of acoustic models on the limited in-domain speech data. Then, we propose Gaussian sharing to combine in-domain models with out-of-domain models, and a data generation process to simulate the presence of more speakers in the in-domain data. In a spoken language dialog application, we contrast our methods against an upper accuracy bound of 69.1% (model trained on many in-domain data) and a lower bound of 60.8% (no in-domain data). Using only 2 hours of in-domain speech for model estimation, we improve the accuracy by 5.1% (to 65.9%) over the lower bound.

#23 Unsupervised topic adaptation for lecture speech retrieval [PDF] [Copy] [Kimi]

Authors: Atsushi Fujii ; Tetsuya Ishikawa ; Katsunobu Itou ; Tomoyosi Akiba

We are developing a cross-media information retrieval system, in which users can view specific segments of lecture videos by submitting text queries. To produce a text index, the audio track is extracted from a lecture video and a transcription is generated by automatic speech recognition. In this paper, to improve the quality of our retrieval system, we extensively investigate the effects of adapting acoustic and language models on speech recognition. We perform an MLLR-based method to adapt an acoustic model. To obtain a corpus for language model adaptation, we use the textbook for a target lecture to search a Web collection for the pages associated with the lecture topic. We show the effectiveness of our method by means of experiments.

#24 Mean and covariance adaptation based on minimum classification error linear regression for continuous density HMMs [PDF] [Copy] [Kimi]

Authors: Haibin Liu ; Zhenyang Wu

The performance of speech recognition system will be significantly deteriorated because of the mismatches between training and testing conditions. This paper addresses the problem and proposes an algorithm to adapt the mean and covariance of HMM simultaneously within the minimum classification error linear regression (MCELR) framework. Rather than estimating the transformation parameters using maximum likelihood estimation (MLE) or maximum a posteriori, we proposed to use minimum classification error (MCE) as the estimation criterion. The proposed algorithm, called IMCELR (Improved MCELR), has been evaluated on a Chinese digit recognition tasks based on continuous density HMM. The experiments show that the proposed algorithm is more efficient than maximum likelihood linear regression with the same amount of adaptation data.

#25 Design of ready-made acoustic model library by two-dimensional visualization of acoustic space [PDF] [Copy] [Kimi]

Authors: Goshu Nagino ; Makoto Shozakai

This paper proposes the technique enabling a design of ready-made library composed of high performance and small size acoustic models utilizing the method of visualizing multiple HMM acoustic models onto two-dimensional space (the COSMOS method: aCOustic Space Map Of Sound), and providing one of these models without overburdening users. The acoustic space (as expressed in multi-dimensional future parameters) is partitioned into zones on twodimensional space, allowing for the creation of highly precise acoustic models through the generation of acoustic models for respective zones of the acoustic space. A set of these acoustic models is called an acoustic model library. In an experiment of this paper, a plotted map (called the COSMOS map) featuring a total of 145 male speakers speaking in various styles was generated utilizing the COSMOS method. Through the COSMOS map, the distribution of each speaking styles and the relationship between the positioning of the speaker on the COSMOS map and the speech-recognition performance were analyzed, thereby demonstrating the effectiveness of the COSMOS method in the analysis of acoustic space. The COSMOS map was then partitioned into concentric acoustic space zones to produce acoustic models representing each acoustic space zones. By selecting the acoustic model providing maximum likelihood score effectively using voice samples consisting of 5 words, the acoustic model, even if expressed in single Gaussian distribution, showed high performance comparable to speaker-independent acoustic model (called SI-model) expressed in 16 mixture Gaussian distributions. Furthermore, the acoustic model showed performance higher than SI-model adapted with voice samples of 30 words by the MLLR [2] method.